Structural Genomics Analysis: Phylogenetic Patterns of Unique, Shared, and Common Folds in 20 Genomes

نویسندگان

  • Hedi Hegyi
  • Jimmy Lin
  • Mark Gerstein
چکیده

We carried out a structural-genomics analysis of the folds in the first 20 completely sequenced genomes, focusing on the patterns of fold usage. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, folds could be assigned to about a fourth of the ORFs in the genomes and about a fifth of the amino acids in the proteomes. More than 80% of all the folds in the scop structural classification were identified in one of the 20 organisms, with worm and E. coli having the largest number of distinct folds. Folds are particularly effective at comprehensively measuring levels of gene duplication, as they group together even very remote homologues. Using folds, we find the average level of duplication varies depending on the complexity of the organism, ranging from 2.4 in M. genitalium to 32 for the worm -values significantly higher than those observed based purely on sequence similarity. We rank the common folds in the 20 organisms, finding that the top three folds are the P-loop NTP hydrolase, the ferrodoxin fold, and the TIM-barrel. We also discuss in detail the many factors that affect and bias these rankings. From the overall patterns of shared folds, we were able to group the 20 organisms into a whole-genome tree, which is similar but not identical to the classic ribosomal tree. We also focus on specific patterns of fold (and fold pair) occurrence in the genomes, associating some of them with instances of horizontal transfer and others with gene loss. In particular, we find three possible examples of transfer between archaea and bacteria and six between eukarya and bacteria. We make available our detailed results at the following URL: http://bioinfo.mbb.yale.edu/genome/20. INTRODUCTION Structural genomics, which combines structural biology with genomics, is emerging as a new sub-discipline. It has a central concept of mapping the whole protein structure space – i.e. determining the complete protein-fold "parts list." Estimates for the number of naturally occurring folds run somewhere between 1,000 and 10,000 [1-3] and the current structural classifications divide the known structures into ~500 known folds [4-6]. Large-scale sequence analysis of structural domains in completely sequenced microbial and eukaryotic genomes will affect both the set of proteins to be selected for experimental high-throughput structure determination and the biological conclusions we eventually draw from the massive amount of experimental work. It is timely, therefore, to perform such an analysis by comparing the sequences of the currently completed genomes to those of the already resolved and classified structural domains. Here, we survey the patterns of fold usage in the first 20 completely sequenced genomes, in the manner of a demographic census. This enables us to identify unique folds, which are potentially antibiotic targets in pathogens; shared folds, which provide information on evolutionary relatedness; common folds, which may be generic scaffolds; and overall patterns of fold usage, which may reveal aspects of protein structure and evolution beyond that found by sequence similarity. We also survey the level of gene duplication implied by the sharing of the same fold by many genes, finding that it varies greatly between genomes. Our work follows upon previous (mostly smaller-scale) surveys of the occurrence of folds in genomes [7-11] and much work on assigning folds to genomes as comprehensively as possible [12-19]. It also relates to a number of previous analyses in more general areas of genomics. One goal of large-scale genome analysis is to study the evolution of completely sequenced organisms by deciphering their genetic makeup through identifying orthologs and paralogs in their genomes [20]. These studies also provide information about the conserved core of the genomes, which are necessary to the basic cellular functions of all bacteria, archea and eukaryotes. Another interesting aspect of evolution is the relatively high frequency with which these primitive organisms incorporate foreign genes into their genomes, i.e. horizontal gene transfer [21]. These horizontally transferred genes can represent new folds in the organism and provides a possible mechanism for an organism to acquire a new "part". Analyzing a large number of closely related genomes helps to clarify this issue with greater certainty than in the past [22]. Large-scale genome comparison has also provided a glimpse into the evolutionary process of genome degradation in parasitic microorganisms [23]. Another goal of genomics is to study biological function on a large scale in terms of the functions of many proteins. Recent success in assigning a function to a novel protein based merely on its structure (i.e. guessing what a part does from its shape) suggests that structural genomics might be useful in this endeavor. For example, Stawiski et al. identified several novel proteases based purely on their unique structural features [24], and Eisenstein et al. outlined a strategy to characterize 65 novel H. influenzae proteins through high-throughput crystallography [25]. In terms of functional assignment, there has been much recent progress based on comparing phylogenetic profiles of different gene products. These studies predict the function of an uncharacterized protein based on its consistent appearance with a protein of known function in the same genomes. Eisenberg and co-workers studied correlated evolution using phylogenetic profiles derived from 16 completely sequenced genomes, and used these, in addition to patterns of domain fusion, to identify functionally related proteins [26, 27]. Enright et al. followed a similar approach and identified several unique fusion events by comparing the complete genomes of two bacteria and an archaea [28]. Reflecting the great amount of experimental functional information available for E. coli, this organism's genome been studied in rather great detail in terms of functional prediction and structure-function relationships [29-32]. Finally, genomics is also driven by practical goals, such as the need to discover new antibiotics to treat emerging antibiotics-resistant bacteria. Genes that are conserved in several microbial genomes but are missing from eukaryotic and archaeal genomes would be ideal targets for broad-spectrum antibiotics [33]. Another approach is to identify species-specific genes with unique structures to reveal organism-specific biochemical pathways. Such genes are suspected to play a role in the pathogenicity of the bacteria [34] and could be used to develop antibiotics against specific pathogens. Materials and Methods Specific Databases Used in the Sequence Comparisons Table 1A shows a list of 20 genomes we analyzed, their phylogenetic classifications, and their sizes. They represent all three domains of life (Archaea, Bacteria and Eukaryota). 19 of the 20 are single-cell organisms, and one is a eukaryote (yeast), with genome size varying from 479 (M.genitalium) to 6218 ORFs (yeast). The only metazoan of the twenty, C.elegans, has ~19000 ORFs, and the average genome size, which we denote by G below is 2179. Insert Table 1A -"Organisms + ORF AA Coverage” We compared the amino acid sequences of the structural domains in the SCOP classification of protein structures [4] to the sequences of the 20 genomes. (Specifically, we used a clustered version of the scop database 1.39, called pdb95d, as queries. This contains 3266 distinct representative sequences, which we denote as P.) For the PSI-blast runs we also used a 90% non-redundant protein database NRDB90 [35] in our comparisons. The version we used is from December 1999 and contains 195,866 sequences (denoted as N). Both the databases (NRDB and the genome sequences) and the query sequences (scop domain) were masked with the SEG program using standard parameters to mask lowcomplexity regions [36, 37]. Fold assignment by PSI-BLAST, Development of a Fast Hybrid Protocol One of the goals of this work was to develop a simple, robust approach for automatically using PSI-blast [38] to do fold assignments to genomes in bulk. For all our PSI-blast runs we used an inclusion threshold (h) of 10, a number of iterations (j) of 10, and a final match threshold of 10. These parameters, considerably more conservative than in a number of recent analyses [14, 15, 39-41]. We used these parameters because we intended that our fold assignments run in a highly automated fashion and we wanted to guard against false positives that would not be caught by manual checking. Furthermore, while PSI-blast, with proper masking for lowcomplexity regions, is known to be quite robust, the iterations occasionally do go out of control with fairly liberal parameter choices (particularly the inclusion threshold h) and we wished to specifically guard against this. Moreover, since we varied the size of the databases (see below) used in a variety of the runs, we wanted to try to ensure that our parameter choices resulted in significant matches in any of the databases used. We performed our PSI-blast comparisons in a number of ways: (i) Default Protocol We concatenated the sequences of a genome onto NRDB and used PSI-blast to run the scop domains as queries against them. This is the "default" way to run PSI-blast. However, it has the drawback that every time one adds a new genome to the analysis, even a small one, one has to re-run each scop domain against the new genome and all of NRDB, a computationally intensive process. That is, each genome requires approximately (N+G)PK pairwise comparisons, where K is the average number of iterations required by a PSI-blast comparison. (K obviously depends on many factors, including various biases both in the target database and the query, but for rough reckoning we can estimate it at j/2 = 5.) This is a very rough number, which we plan to use below for illustrative purposes. Using the values above it comes out to ~3.2 billion (3,234,074,850). (ii) NRDB PSI-blast Profiles We ran each scop query against NRDB to generate a PSI-blast profile, giving us a profile for each scop fold and superfamily. Then we re-ran these against the genomes without iteration, using a match threshold of 10. (Note that because we use very conservative choices for the inclusion threshold in building up the original PSI-blast profiles, at this stage we can confidently assume that the final match threshold of 10 is selecting truly similar sequences to our original scop domain queries.) Note also that this is potentially a much more efficient process, since when one analyzes a new genome one only need run the profiles against each genome sequence once. That is, each new genome requires GP comparisons. (There is no K factor since there is no iteration.) Plugging in the numbers above, we get ~7.1 million (7,116,614). (iii) Intra-genome Profiles A problem with the above approach is that often the proteins that contribute most to the PSI-blast profile for a given query are in the same organism as the query. This could result, for instance, if one is searching for a protein in a family that is highly duplicated in one organism but otherwise does not have wide phylogenetic distribution. Thus, given a new genome with a highly duplicated family, one could potentially compromise sensitivity using solely NRDB generated profiles. (This would not be a problem in the default approach since one would include the genome with NRDB in the making up the of the profiles.) To get around this, while still retaining some computational efficiency for each new genome, we tried running each scop domain query against the genome with PSI-blast. For this protocol, for each new genome, we will require GKP comparisons, which evaluates to ~36 million (35,583,070) -of course, assuming the same value for K as above, which is only approximately true. (iv) Hybrid Protocol For a number of select genomes, in particular m. genitalium, yeast and worm, we carefully compared the matches resulting from the above three protocols. We found that for the larger genomes, such as worm, use of the intra-genome profiles (protocol iii) generated quite a few additional matches beyond those found by the straight NRDB profiles (ii). In particular, using the intra-genome protocol for the worm we found 501 extra matches that were not found by the NRDB profiles (while the NRDB profiles found 576 matches that the intra-genome protocol did not find). Combining the matches from the NRDB profiles and the intra-genome profiles (protocols ii and iii) into a new hybrid protocol resulted in essentially the same set of matches as the default PSI-blast protocol (i). For instance, for m. genitalium, the hybrid protocol produced at least one match for 163 different ORFs of the 483 total ORFs, whereas the default protocol produced matches for 161 different ORFs. These numbers are very similar to the values found in other PSI-blast analyses. [14, 15, 21, 39] different ORFs. Moreover, for a new genome this was considerably more efficient than the default method, 7.1 + 3.6 vs. 3,234 million comparisons, about 75 times more comparisons using the numbers above. To make the results of the various protocols completely clear, we make available on the web sets of matches resulting from running with the three protocols. See http://bioinfo.mbb.yale.edu/genomes/20. Note also that since in our hybrid protocol we are "mixing" databases for the comparisons, the precise evalues for each comparison are not exactly comparable. This is another reason for the very conservative choices we made above for our PSI-blast thresholds. Fold assignment by FASTA, a Benchmark As a further benchmark comparison, we ran the scop domains directly against the genomes using fasta with a standard .01 e-value cutoff [42-44]. It is known that simple pairwise comparison with either fasta or blastp is considerably less sensitive than profile-search with PSI-blast, so we did not expect this to add substantially to the number of matches that we found. However, we elected to perform the fasta searches because for certain small compositionally biased proteins, the PSI-blast profiles may not be effective [39, 41]. Also, we felt that these would be a useful benchmark for comparison against PSIblast. As expected, we only found a very small number of additional matches with fasta. For instance, for the worm, the combination of the PSI-blast approaches produced at least one match for 4556 ORFs of the 19099. Fasta only added in 30 additional matches to these, considerably less than 1%, and it, of course, it missed 1553 of the matches. Tabulation in terms of Scop Folds and Superfamilies Using the SCOP scheme we tabulated our results in terms of distinct folds and structural superfamilies. In scop, for structures to have the same fold it is necessary for them to have the same overall core topology and geometric disposition of secondary structures. In contrast, a superfamily is a subset of the fold, denoting groups of proteins that have closer structural similarity and consequently probably share an evolutionary relationship [4]. We will report our specific results here separately in terms of both scop folds and structural superfamilies; however, in the text it is awkward to constantly refer to "scop folds and structural superfamilies" so sometimes we will loosely use the term "fold" to stand for both scop fold and superfamily. For instance, we will use the terms "fold assignment" and patterns of "fold occurrence" to refer to general ideas that are equally as applicable to scop structural superfamilies as to scop folds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds.

We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, fold...

متن کامل

Structural Genomics Analysis: Characteristics of Atypical, Typical, and Horizontally Transferred Folds

We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, fold...

متن کامل

Proteins: Structure, Function, and Genetics Author Instructions Checklist Adobe Acrobat Users -notes Tool Sheet Reprint Order Form Return Fax Form a Copy of Your Page Proofs for Your Article Proteins: Structure, Function, and Genetics

We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, fold...

متن کامل

Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census.

Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage-whether a given fold occurs in a particular organism. Of the approximately 340 soluble protein folds currently in the structure databank (PDB), 240 occur in at le...

متن کامل

In silico Comparison of 19 Porphyromonas gingivalis Strains in Genomics, Phylogenetics, Phylogenomics and Functional Genomics

Currently, genome sequences of a total of 19 Porphyromonas gingivalis strains are available, including eight completed genomes (strains W83, ATCC 33277, TDC60, HG66, A7436, AJW4, 381, and A7A1-28) and 11 high-coverage draft sequences (JCVI SC001, F0185, F0566, F0568, F0569, F0570, SJD2, W4087, W50, Ando, and MP4-504) that are assembled into fewer than 300 contigs. The objective was to compare t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001